Welcome back to pattern recognition. So today we want to look a bit more into the EM algorithm
and in particular we want to learn how to apply to other problems and we will start
with the so-called missing information principle.
Now the missing information principle is as simple as that the observable information
is given as the complete information minus the hidden information. Now let's look at this
in a little more mathematical way. So we can formalize the entire problem as the
observable random variable x, the hidden random variable y and the parameter set theta. Now we can
use this to formalize the joint probability density of the events x, the observable part
and the hidden part y as probability of x and y and parameters theta can be expressed as the
probability of x theta times the probability of y given x. Now we can reformulate this and we see
that our p of x and theta is given as the fraction of p of x and y and theta divided by p of y given
x. So now we can look at this and apply the negative logarithm so we can see that minus
the logarithm of p of x theta is equal to minus the logarithm of p of x y and theta minus minus
the logarithm of p of y given x and theta. Now let's rewrite this a little bit and the idea
that we want to introduce is we want to write it in an iterative iteration scheme. Now we can write
this essentially as the i plus 1 iteration and write this essentially in the same form but with
an additional iteration index and now our theta is essentially the estimate of our parameter vector
theta. So this is now indicating the parameter step. Now let's multiply both sides with p of y
given x and the parameter vector in the i-th iteration and we integrate over the entire
hidden event y. So we introduce the integral and we multiply with this additional probability.
Then this can be written as the integral on the left hand side and of course also on the right
hand side we can see that we can split the two integrals and we end up with something that is
the probability of y given x and the parameter vector theta times above logarithms and now you
already see that this looks kind of familiar. So let's look a little bit more on the left hand side
and here you can see that if we rearrange this a little bit so you can see that we can pull out
the logarithm and the probability of x and theta i plus 1 and this then can be essentially rearranged
as this logarithm because the integral over p of y given x over the entire domain of y is just going
to be 1. So we remain with the logarithm of the probability of x. So we can observe that the left
hand side of the equation is the log likelihood function of our observations. So we encountered
this term previously and we can now use this to express the log likelihood function.
So the maximization of the right hand side of our key equation here corresponds to a
maximum likelihood estimation. Now if you look at the terms on the right hand side we see that we
can introduce the following notation. Formally this is a little bit incorrect but it's just
a small alteration of the iteration indices and this then gives rise to the Kohlberg-Libert
statistics that can be expressed as Q of the parameter vector theta hat in iteration i and
iteration i plus 1 is given as the integral of p of y given x times the logarithm of p of x and y
and the integration over y. Furthermore we observe the entropy which is here defined as h between our
parameter vector theta hat at iteration i and iteration i plus 1 that is the negative integral
over the probability of y given x times the logarithm of the probability of y given x and
the parameter vector in the next iteration. Now let's first have a closer look at the Kohlberg-Libert
statistics. If we look at the statistics between theta and theta prime and we write out the
expression as an above equation then we can see that the Kohlberg-Libert statistics which are also
called the Q function with respect to theta prime and theta is nothing else than the conditional
expectation of the log likelihood of the joint probability of x and y.
Now let's have a look again at the key equation and you can see that we can now write it down
as the log likelihood function of x and this can be expressed as the Q statistics plus the entropy
of the old parameter set and the new parameter set. So now we want to motivate that the maximization
of the Kohlberg-Libert statistics can actually replace the optimization of the log likelihood
function. Note that we only look at this on a rather coarse level. A complete proof can be found
in the literature if you have a look at the further readings. So let's look a bit into the
Presenters
Zugänglich über
Offener Zugang
Dauer
00:20:01 Min
Aufnahmedatum
2020-11-14
Hochgeladen am
2020-11-15 00:37:39
Sprache
en-US
In this video, we analyze the expectation-maximization algorithm and relate it to Kullback-Leibler statistics in the context of the missing information principle.
This video is released under CC BY 4.0. Please feel free to share and reuse.
For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.
Music Reference: Damiano Baldoni - Thinking of You